A major proportion of retail bank profit comes from interests in the form of home loans. These loans are borrowed by regular income/high-earning customers. Banks are most fearful of defaulters, as bad loans (NPA) usually eat up a major chunk of their profits. Therefore, it is important for banks to be judicious while approving loans for their customer base. The approval process for the loans is multifaceted. Through this process, the bank tries to check the creditworthiness of the applicant on the basis of a manual study of various aspects of the application. The entire process is not only effort-intensive but also prone to wrong judgment/approval owing to human error and biases. There have been attempts by many banks to automate this process by using heuristics. But with the advent of data science and machine learning, the focus has shifted to building machines that can learn this approval process and make it free of biases and more efficient. At the same time, one important thing to keep in mind is to make sure that the machine does not learn the biases that previously crept in because of the human approval process..
A bank's consumer credit department aims to simplify the decision-making process for home equity lines of credit to be accepted. To do this, they will adopt the Equal Credit Opportunity Act's guidelines to establish an empirically derived and statistically sound model for credit scoring. The model will be based on the data obtained via the existing loan underwriting process from recent applicants who have been given credit. The model will be built from predictive modeling techniques, but the model created must be interpretable enough to provide a justification for any adverse behavior (rejections).
Build a classification model to predict clients who are likely to default on their loan and give recommendations to the bank on the important features to consider while approving a loan.
The Home Equity dataset (HMEQ) contains baseline and loan performance information for recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. There are 12 input variables registered for each applicant.
● BAD: 1 = Client defaulted on loan, 0 = loan repaid
● LOAN: Amount of loan approved
● MORTDUE: Amount due on the existing mortgage
● VALUE: Current value of the property
● REASON: Reason for the loan request (HomeImp = home improvement, DebtCon= debt
consolidation which means taking out a new loan to pay off other liabilities and consumer
debts)
● JOB: The type of job that loan applicant has such as manager, self, etc.
● YOJ: Years at present job
● DEROG: Number of major derogatory reports (which indicates serious delinquency or late
payments).
● DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a
borrower does not make the minimum required payments 30 to 60 days past the day on
which the payments were due)
● CLAGE: Age of the oldest credit line in months
● NINQ: Number of recent credit inquiries
● CLNO: Number of existing credit lines
● DEBTINC: Debt-to-income ratio (all monthly debt payments divided by gross monthly
income. This number is one of the ways lenders measure a borrower’s ability to manage the
monthly payments to repay the money they plan to borrow)
#Using tqdm to show progress bar
! pip install tqdm
Requirement already satisfied: tqdm in c:\users\sheidu omuya yusuf\anaconda3\lib\site-packages (4.65.0) Requirement already satisfied: colorama in c:\users\sheidu omuya yusuf\anaconda3\lib\site-packages (from tqdm) (0.4.6)
Import Necessary Libraries
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model using statsmodels
import statsmodels.api as sm
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error
#To ignore unecessary errors
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn import linear_model,svm
from imblearn.over_sampling import SMOTE
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import sklearn.linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor,RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, train_test_split
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter("ignore", ConvergenceWarning)
# this will help in making the Python code more structured automatically (help adhere to good coding practices)
#%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import make_scorer
conda install -c conda-forge scikit-learn
Note: you may need to restart the kernel to use updated packages.
usage: conda-script.py [-h] [--no-plugins] [-V] COMMAND ... conda-script.py: error: unrecognized arguments: scikit-learn
# To get diferent metric scores
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
#Access and read the dataset
loan_predict=pd.read_csv("hmeq.csv")
#Make a copy of the original dataset
df=loan_predict.copy()
#view top 5 row of the dataset
df.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.00000 | 39025.00000 | HomeImp | Other | 10.50000 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 9.00000 | NaN |
| 1 | 1 | 1300 | 70053.00000 | 68400.00000 | HomeImp | Other | 7.00000 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 14.00000 | NaN |
| 2 | 1 | 1500 | 13500.00000 | 16700.00000 | HomeImp | Other | 4.00000 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 10.00000 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.00000 | 112000.00000 | HomeImp | Office | 3.00000 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 14.00000 | NaN |
#view last 5 row of the dataset
df.tail()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5955 | 0 | 88900 | 57264.00000 | 90185.00000 | DebtCon | Other | 16.00000 | 0.00000 | 0.00000 | 221.80872 | 0.00000 | 16.00000 | 36.11235 |
| 5956 | 0 | 89000 | 54576.00000 | 92937.00000 | DebtCon | Other | 16.00000 | 0.00000 | 0.00000 | 208.69207 | 0.00000 | 15.00000 | 35.85997 |
| 5957 | 0 | 89200 | 54045.00000 | 92924.00000 | DebtCon | Other | 15.00000 | 0.00000 | 0.00000 | 212.27970 | 0.00000 | 15.00000 | 35.55659 |
| 5958 | 0 | 89800 | 50370.00000 | 91861.00000 | DebtCon | Other | 14.00000 | 0.00000 | 0.00000 | 213.89271 | 0.00000 | 16.00000 | 34.34088 |
| 5959 | 0 | 89900 | 48811.00000 | 88934.00000 | DebtCon | Other | 15.00000 | 0.00000 | 0.00000 | 219.60100 | 0.00000 | 16.00000 | 34.57152 |
print('The number of rows (observations) is', df.shape[0],'\n''The number of columns (variables) is',df.shape[1])
from tqdm import tqdm
for i in tqdm (range (100), desc="Loading..."):
pass
The number of rows (observations) is 5960 The number of columns (variables) is 13
Loading...: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<?, ?it/s]
# Understand the shape of the data
df.shape
(5960, 13)
# To print the essential information about the data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
#Count of number of client that repaid /defaulted
df["BAD"].value_counts()
0 4771 1 1189 Name: BAD, dtype: int64
target = df['BAD'].value_counts()
#labels = ['No default', 'Defaulted']
#sizes = [80, 20] # Percentages (e.g., 80% No default, 20% Defaulted)
# Colors for the slices
colors = ['#37FD12', 'red']
# Labels the features
mylabels = ['No default', 'Defaulted']
# Create the 3D pie chart
plt.pie(target, colors=colors, labels=mylabels, explode=[0, 0.1], autopct='%1.1f%%', startangle=0, labeldistance=1.2, pctdistance=0.6, shadow=True)
# Add a title
plt.title('Loan Default Status')
# Display the chart
plt.axis('equal') # Equal aspect ratio ensures a circular pie chart
plt.show()
#Statistical summary of the dataset
df.describe(include = 'number').T.round(2)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| BAD | 5960.00000 | 0.20000 | 0.40000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| LOAN | 5960.00000 | 18607.97000 | 11207.48000 | 1100.00000 | 11100.00000 | 16300.00000 | 23300.00000 | 89900.00000 |
| MORTDUE | 5442.00000 | 73760.82000 | 44457.61000 | 2063.00000 | 46276.00000 | 65019.00000 | 91488.00000 | 399550.00000 |
| VALUE | 5848.00000 | 101776.05000 | 57385.78000 | 8000.00000 | 66075.50000 | 89235.50000 | 119824.25000 | 855909.00000 |
| YOJ | 5445.00000 | 8.92000 | 7.57000 | 0.00000 | 3.00000 | 7.00000 | 13.00000 | 41.00000 |
| DEROG | 5252.00000 | 0.25000 | 0.85000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 10.00000 |
| DELINQ | 5380.00000 | 0.45000 | 1.13000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 15.00000 |
| CLAGE | 5652.00000 | 179.77000 | 85.81000 | 0.00000 | 115.12000 | 173.47000 | 231.56000 | 1168.23000 |
| NINQ | 5450.00000 | 1.19000 | 1.73000 | 0.00000 | 0.00000 | 1.00000 | 2.00000 | 17.00000 |
| CLNO | 5738.00000 | 21.30000 | 10.14000 | 0.00000 | 15.00000 | 20.00000 | 26.00000 | 71.00000 |
| DEBTINC | 4693.00000 | 33.78000 | 8.60000 | 0.52000 | 29.14000 | 34.82000 | 39.00000 | 203.31000 |
df.describe(include = 'object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| REASON | 5708 | 2 | DebtCon | 3928 |
| JOB | 5681 | 6 | Other | 2388 |
#find missing values
df.isnull().sum()
BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64
# check for duplicated data
df[df.duplicated()].count()
BAD 0 LOAN 0 MORTDUE 0 VALUE 0 REASON 0 JOB 0 YOJ 0 DEROG 0 DELINQ 0 CLAGE 0 NINQ 0 CLNO 0 DEBTINC 0 dtype: int64
plt.figure(figsize = (12,8))
sns.heatmap(df.isnull(), cbar = False, cmap = 'coolwarm', yticklabels = False)
plt.show() #
df.isnull().sum().sort_values(ascending = False)/df.index.size
DEBTINC 0.21258 DEROG 0.11879 DELINQ 0.09732 MORTDUE 0.08691 YOJ 0.08641 NINQ 0.08557 CLAGE 0.05168 JOB 0.04681 REASON 0.04228 CLNO 0.03725 VALUE 0.01879 BAD 0.00000 LOAN 0.00000 dtype: float64
def univar_vis(fd): # Define univariate visualization function
title = fd.name
fig, axes = plt.subplots(2, 2, figsize=(10, 6))
fig.suptitle(title.upper() + " " + "Distribution")
sns.distplot(fd, color="green", bins=5, ax=axes[0, 0], hist=None)
sns.boxplot(fd, ax=axes[0, 1])
sns.violinplot(fd, ax=axes[1, 0])
sns.histplot(fd, ax=axes[1, 1])
axes[0, 0].axvline(fd.mean(), color="black", linewidth=0.7)
axes[0, 0].axvline(fd.median(), color="red", linewidth=0.3)
axes[0, 1].axvline(fd.median(), color="red", linewidth=0.9)
axes[0, 1].axvline(fd.mean(), color="purple", linewidth=0.7)
axes[1, 0].axvline(fd.mean(), color="purple", linewidth=0.7)
axes[1, 0].axvline(fd.median(), color="green", linewidth=0.7)
axes[1, 1].axvline(fd.mean(), color="purple", linewidth=0.7)
axes[1, 1].axvline(fd.median(), color="green", linewidth=0.7)
plt.tight_layout()
plt.show()
univar_vis(df["DEBTINC"])
univar_vis(df["DEROG"])
univar_vis(df["DELINQ"])
univar_vis(df["MORTDUE"])
univar_vis(df["YOJ"])
univar_vis(df["NINQ"])
univar_vis(df["CLAGE"])
univar_vis(df["CLNO"])
univar_vis(df["VALUE"])
univar_vis(df["BAD"])
univar_vis(df["LOAN"])
# Display columns
df.columns
Index(['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'REASON', 'JOB', 'YOJ', 'DEROG',
'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC'],
dtype='object')
num_col = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG',
'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
cat_coln = [ 'REASON', 'JOB',]
sns.countplot(x ='JOB', data=df, hue='BAD') #Job vs default /repaid
plt.show()
sns.countplot(x ='REASON', data=df, hue='BAD') # Reason for loan vs default/repaid
plt.show()
fig,axes = plt.subplots(5,2,figsize=(12,15))
for idx,cat_col in enumerate(num_col):
row,col = idx//2,idx%2
sns.boxplot(y=cat_col,data=df,x='BAD',ax=axes[row,col])
plt.subplots_adjust(hspace=1)
# pair plot showing relationsh amoung variables
sns.pairplot(data=df, hue='BAD', corner=True)
from time import sleep
from tqdm import tqdm
for i in tqdm (range (10)):
sleep(3)
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:30<00:00, 3.00s/it]
# heatmap
plt.figure(figsize=(12,12))
sns.heatmap(df[num_col].corr(),annot=True)
<Axes: >
dfl = df.groupby(["JOB"])[["LOAN"]].mean()
dfl
| LOAN | |
|---|---|
| JOB | |
| Mgr | 19155.28031 |
| Office | 18142.61603 |
| Other | 18061.68342 |
| ProfExe | 18983.46395 |
| Sales | 14913.76147 |
| Self | 28314.50777 |
ax = sns.catplot(
x="JOB", y="LOAN", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
title="JOB TYPE vs. AVERAGE LOAN AMOUNT");
dfl2= df.groupby(["REASON"])[["LOAN"]].mean()
dfl2
| LOAN | |
|---|---|
| REASON | |
| DebtCon | 19952.95316 |
| HomeImp | 16006.62921 |
ax = sns.catplot(
x="REASON", y="LOAN", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=0).set(
title="REASON FOR LOAN WITH AVERAGE LOAN AMOUNT");
ax = sns.catplot(
x="REASON", y="MORTDUE", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=0).set(
title="REASON FOR LOAN vs AMOUNT DUE ON EXISTING MORTGAGE");
ax = sns.catplot(
x="JOB", y="MORTDUE", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=0).set(
title="JOB TYPE vs. AMOUNT DUE ON EXISTING MORTGAGE");
ax = sns.catplot(
x="JOB", y="DELINQ", data=df, kind="bar", height=4.5, aspect=3
) # Applicant's job type vs average number of delinquent credit line
ax.set_xticklabels(rotation=00).set(
title="LOAN APPLICANT'S JOB WITH AVERAGE NUMBER OF DELINQUENT CREDIT LINES");
x = sns.catplot(
x="REASON", y="DELINQ", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
title="REASON FOR LOAM WITH AVERAGE NUMBER OF DELINQUENT CREDIT LINES");
x = sns.catplot(
x="JOB", y="DEBTINC", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
title="LOAN APPLICANT'S JOB FOR WITH AVERAGE NUMBER OF DELINQUENT CREDIT LINES");
x = sns.catplot(
x="REASON", y="DEBTINC", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
title="LOAN APPLICANT'S JOB FOR WITH AVERAGE NUMBER OF DELINQUENT CREDIT LINES");
x = sns.catplot(
x="REASON", y="CLNO", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
title="LOAN APPLICANT'S JOB FOR WITH AVERAGE NUMBER OF EXISTING CREDIT LINES");
x = sns.catplot(
x="JOB", y="CLNO", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
title="LOAN APPLICANT'S JOB FOR WITH AVERAGE NUMBER OF EXISTING CREDIT LINES");
Defaults = df.copy()
print(Defaults["REASON"].value_counts())
print(Defaults["JOB"].value_counts())
DebtCon 3928 HomeImp 1780 Name: REASON, dtype: int64 Other 2388 ProfExe 1276 Office 948 Mgr 767 Self 193 Sales 109 Name: JOB, dtype: int64
The scale of each attribute is different, we need to normalize all the features. Some attributes have skewed distribution Some attributes have a lot of outliers (DEBTINC, LOAN, MORTDUE, VALUE) For normalizing, we can use Min-Max Scaler, but attributes like LOAN, MORTDUE, VALUE have a lot of outliers (from boxplot), so we will also try Z-Score Normalization (preferred). For fixing the skewness, we need to transform the attributes. Our basic transformation did improve the distribution of some attributes like LOAN, MORTDUE and VALUE.
Numerosity Reduction : Apart from the above needed steps, many tuple/observations might have many missing values in their attributes. We can consider dropping them to improve the data quality. For this we need to decide a threshold value, such that the data quality is also improved and a lot of data isn't lost. Feature Reduction - Dropping columns with same value for most of the observations (DELINQ and DEROG), and after considering their Correlation and Predictive Power Score(REASON and JOB).
Plotting Heatmap of correlation Matrix, to understand the type of linear relation between attributes. We will again plot Heatmap of correlation after cleaning and transforming the attributes.
Defaults["PROBINC"] = Defaults.MORTDUE/Defaults.DEBTINC # adding new feature, (current debt on mortgage)/(debt to income ratio). this feature helps to evaluate the financial stability and t
from scipy.stats import yeojohnson
Defaults_temp = Defaults.copy()
Defaults_temp["LOAN"] = yeojohnson(Defaults["LOAN"])[0] # transforming LOAN using yeo-johnson method
Defaults1 = Defaults_temp.copy()
Defaults_temp["MORTDUE"] = np.power(Defaults["MORTDUE"],1/8) # transforming MORTDUE by raising it to 1/8
Defaults_temp["YOJ"] = np.log(Defaults["YOJ"]+10)
Defaults_temp["VALUE"] = np.log(Defaults["VALUE"]+10)
Defaults_temp["CLNO"] = np.log(Defaults["CLNO"]+10)
Defaults2 = Defaults_temp.copy()
Defaults2.columns
Index(['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'REASON', 'JOB', 'YOJ', 'DEROG',
'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'PROBINC'],
dtype='object')
# Checking Outliers in dataset
col_names = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG',
'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'PROBINC']
col_names = list(col_names)
fig, ax = plt.subplots(len(col_names), figsize=(8,50))
for i, col_val in enumerate(col_names):
sns.boxplot(y=Defaults2[col_val], ax=ax[i])
ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
ax[i].set_xlabel(col_val, fontsize=8)
plt.show()
col = ['BAD','REASON','JOB']
df_X = Defaults2.drop(col, axis= 1)
df_Y = df[col]
def treat_outlier(x): #Outlier treatment
# taking 5,25,75 percentile of column
q5= np.percentile(x,5)
q25=np.percentile(x,25)
q75=np.percentile(x,75)
dt=np.percentile(x,95)
#calculationg IQR range
IQR=q75-q25
#Calculating minimum threshold
lower_bound=q25-(1.5*IQR)
upper_bound=q75+(1.5*IQR)
#Calculating maximum threshold
#print(q5,q25,q75,dt,min,max)
#Capping outliers
return x.apply(lambda y: dt if y > upper_bound else y).apply(lambda y: q5 if y < lower_bound else y)
for i in df_X:
df_X[i]=treat_outlier(df_X[i])
df = pd.concat([df_X, df_Y], axis = 1)
from sklearn.impute import SimpleImputer
# Identify columns with missing values
columns_with_missing = df.columns[df.isnull().any()]
# Impute numerical columns with mean
numerical_cols = df.select_dtypes(include='number').columns
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())
# Impute categorical columns with mode
categorical_cols = df.select_dtypes(include='object').columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])
# Verify if there are still any missing values
remaining_missing = df.isnull().sum().sum()
print(f"Remaining missing values: {remaining_missing}")
Remaining missing values: 0
The data has now been prepared and ready for the next step of model building. We will encode the categorical variable then proceed to built the models
# INSIGHTS
Loan Approval and Defaults: The dataset comprises 5,960 observations, with a 20% default rate. The average approved loan amount is ~18,608, and the maximum is 89,900. Clients with relatively high loan amounts seem to have repaid successfully.
Mortgage and Property Values: Average due on existing mortgages is ~737,609, with a maximum of 855,909. Higher current property values are associated with a higher default rate.
Debt-to-Income Ratio: The average debt-to-income ratio is 33.77, within a favorable range. Higher debt-to-income ratios are associated with a higher default rate.
Credit Lines and Enquiries: The average number of existing credit lines is 21. Higher numbers of derogatory remarks, delinquent credit lines, and credit inquiries are associated with a higher default rate.
Reasons for Loan and Job Types: Debt consolidation is the most common reason for a loan, with the highest num Clients in the "Others" job category have the highest default rate.
Distribution and Outliers:
Several variables are not normally distributed and exhibit right skewness, indicating the presence of outliers.
Correlations: MORTDUE (amount due on existing mortgage) is highly correlated with VALUE (current property value).
Job Types and Loan Amounts: Self-employed clients have the highest average loan amount, while sales professionals have the least.
Debt-to-Income Ratio and Job Types: Sales professionals have the highest average debt-to-income ratio.
Number of Credit Lines and Job Types:
Sales professionals have the highest average number of existing credit lines.
# RECOMMENDATIONS
# KEY POINTS
Prioritize a thorough evaluation of clients with characteristics linked to higher default rates. Tailor loan offerings based on job types, considering the observed differences in loan amounts and default rates. Implement stringent risk assessment for clients in the "Others" job category. Leverage the correlation between MORTDUE and VALUE for potential refinancing opportunities
df.dtypes
LOAN float64 MORTDUE float64 VALUE float64 YOJ float64 DEROG float64 DELINQ float64 CLAGE float64 NINQ float64 CLNO float64 DEBTINC float64 PROBINC float64 BAD int64 REASON object JOB object dtype: object
from sklearn.impute import SimpleImputer
# Identify columns with missing values
columns_with_missing = df.columns[df.isnull().any()]
# Impute numerical columns with mean
numerical_cols = df.select_dtypes(include='number').columns
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())
# Impute categorical columns with mode
categorical_cols = df.select_dtypes(include='object').columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])
# Verify if there are still any missing values
remaining_missing = df.isnull().sum().sum()
print(f"Remaining missing values: {remaining_missing}")
Remaining missing values: 0
df.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON | JOB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | DebtCon | Other |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | HomeImp | Office |
#Let's identify the categorical features
df.columns[df.dtypes == object]
Index(['REASON', 'JOB'], dtype='object')
We want to predict clients who are likely to default on their loan. Before we proceed to build a model, we'll have to encode categorical features. We'll split the data into train and test to be able to evaluate the model that we build on the train data.
X = df.drop(["BAD"], axis=1)
Y = df["BAD"]
# adding constant
X = sm.add_constant(X)
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (4172, 18) Shape of test set : (1788, 18) Percentage of classes in training set: 0 0.80417 1 0.19583 Name: BAD, dtype: float64 Percentage of classes in test set: 0 0.79195 1 0.20805 Name: BAD, dtype: float64
# fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: BAD No. Observations: 4172
Model: Logit Df Residuals: 4154
Method: MLE Df Model: 17
Date: Fri, 22 Dec 2023 Pseudo R-squ.: 0.2415
Time: 16:45:11 Log-Likelihood: -1565.1
converged: True LL-Null: -2063.3
Covariance Type: nonrobust LLR p-value: 5.014e-201
==================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
const 3.6234 1.271 2.851 0.004 1.132 6.115
LOAN -0.1184 0.024 -4.925 0.000 -0.165 -0.071
MORTDUE -0.8414 0.240 -3.501 0.000 -1.313 -0.370
VALUE -0.0365 0.143 -0.256 0.798 -0.316 0.243
YOJ -0.0777 0.136 -0.573 0.567 -0.343 0.188
DEROG 0.6250 0.059 10.656 0.000 0.510 0.740
DELINQ 0.7335 0.046 16.061 0.000 0.644 0.823
CLAGE -0.0047 0.001 -7.182 0.000 -0.006 -0.003
NINQ 0.1667 0.025 6.597 0.000 0.117 0.216
CLNO -0.6487 0.156 -4.167 0.000 -0.954 -0.344
DEBTINC 0.0931 0.009 10.572 0.000 0.076 0.110
PROBINC 0.0002 4.52e-05 3.610 0.000 7.46e-05 0.000
REASON_HomeImp 0.1210 0.106 1.144 0.253 -0.086 0.328
JOB_Office -0.5862 0.182 -3.219 0.001 -0.943 -0.229
JOB_Other -0.0562 0.141 -0.399 0.690 -0.333 0.220
JOB_ProfExe 0.0522 0.164 0.318 0.751 -0.270 0.374
JOB_Sales 0.6965 0.330 2.109 0.035 0.049 1.344
JOB_Self 0.5637 0.265 2.129 0.033 0.045 1.083
==================================================================================
Model evaluation criterion
Model can make wrong predictions as: Predicting a customer will not default but in reality, the customer will default to their loan obligation. Predicting a customer will default to their obligation but in reality, the customer will not default to their obligation.
Which case is more important?
Both the cases are important as:
If we predict that a default will not be occured and the default gets occurred then the bank will lose resources and will have to bear additional costs.
If we predict that a default will get occurred and the booking doesn't get occurred then the bank might not be able to provide satisfactory services to the customer by assuming that this default will be occured. This might damage the brand equity.
How to reduce the losses?
F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Observations
Negative values of the coefficient show that the probability of clients defaulting to their loan obligation decreases with the increase of the corresponding attribute value.
Positive values of the coefficient show that the probability of customer defaulting increases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
But these variables might contain multicollinearity, which will affect the p-values.
We will have to remove multicollinearity from the data to get reliable coefficients and p-values.
There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.
import sklearn.metrics
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.85139 | 0.35373 | 0.75853 | 0.48247 |
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(X_train)
| feature | VIF | |
|---|---|---|
| 0 | const | 750.69990 |
| 1 | LOAN | 1.28254 |
| 2 | MORTDUE | 2.29975 |
| 3 | VALUE | 2.49333 |
| 4 | YOJ | 1.07623 |
| 5 | DEROG | 1.08064 |
| 6 | DELINQ | 1.07127 |
| 7 | CLAGE | 1.13811 |
| 8 | NINQ | 1.09370 |
| 9 | CLNO | 1.28833 |
| 10 | DEBTINC | 1.15834 |
| 11 | PROBINC | 1.24661 |
| 12 | REASON_HomeImp | 1.14039 |
| 13 | JOB_Office | 1.88896 |
| 14 | JOB_Other | 2.56109 |
| 15 | JOB_ProfExe | 2.14889 |
| 16 | JOB_Sales | 1.12629 |
| 17 | JOB_Self | 1.26024 |
Observations
The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
# initial list of columns
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = X_train[cols]
# fitting the model
model = sm.Logit(y_train, x_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'LOAN', 'MORTDUE', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'PROBINC', 'JOB_Office', 'JOB_Sales', 'JOB_Self']
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: BAD No. Observations: 4172
Model: Logit Df Residuals: 4159
Method: MLE Df Model: 12
Date: Fri, 22 Dec 2023 Pseudo R-squ.: 0.2409
Time: 16:45:11 Log-Likelihood: -1566.2
converged: True LL-Null: -2063.3
Covariance Type: nonrobust LLR p-value: 3.256e-205
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 3.2490 0.813 3.998 0.000 1.656 4.842
LOAN -0.1285 0.022 -5.851 0.000 -0.171 -0.085
MORTDUE -0.8471 0.196 -4.313 0.000 -1.232 -0.462
DEROG 0.6265 0.058 10.736 0.000 0.512 0.741
DELINQ 0.7309 0.046 16.057 0.000 0.642 0.820
CLAGE -0.0047 0.001 -7.246 0.000 -0.006 -0.003
NINQ 0.1645 0.025 6.573 0.000 0.115 0.214
CLNO -0.6584 0.151 -4.348 0.000 -0.955 -0.362
DEBTINC 0.0933 0.009 10.688 0.000 0.076 0.110
PROBINC 0.0002 4.53e-05 3.694 0.000 7.86e-05 0.000
JOB_Office -0.5569 0.144 -3.880 0.000 -0.838 -0.276
JOB_Sales 0.6921 0.307 2.257 0.024 0.091 1.293
JOB_Self 0.6223 0.236 2.636 0.008 0.160 1.085
==============================================================================
print("Training performance:")
model_performance_classification_statsmodels(lg1, X_train1, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.85163 | 0.35373 | 0.76053 | 0.48287 |
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg1, X_test1, y_test
)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.82998 | 0.29301 | 0.72667 | 0.41762 |
# converting coefficients to odds
odds = np.exp(lg1.params)
# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
| const | LOAN | MORTDUE | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | JOB_Office | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 25.76339 | 0.87945 | 0.42864 | 1.87097 | 2.07704 | 0.99529 | 1.17879 | 0.51767 | 1.09780 | 1.00017 | 0.57299 | 1.99799 | 1.86326 |
| Change_odd% | 2476.33890 | -12.05525 | -57.13617 | 87.09705 | 107.70447 | -0.47108 | 17.87866 | -48.23295 | 9.77966 | 0.01673 | -42.70095 | 99.79913 | 86.32564 |
Checking model performance on the training set
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train)
print("Training performance:")
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg1, X_train1, y_train
)
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.85163 | 0.35373 | 0.76053 | 0.48287 |
ROC-AUC
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Model Performance Improvement
Optimal threshold using AUC-ROC curve
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.19138084315264084
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.76103 | 0.68543 | 0.43077 | 0.52905 |
Let's check the performance on the test set
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75951 | 0.67742 | 0.44840 | 0.53961 |
Let's use Precision-Recall curve and see if we can find a better threshold
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.3
Checking model performance on training set
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train, threshold=optimal_threshold_curve)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.82814 | 0.53611 | 0.56443 | 0.54991 |
Let's check the performance on the test set
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_curve)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.81823 | 0.48656 | 0.57460 | 0.52693 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.85163 | 0.76103 | 0.82814 |
| Recall | 0.35373 | 0.68543 | 0.53611 |
| Precision | 0.76053 | 0.43077 | 0.56443 |
| F1 | 0.48287 | 0.52905 | 0.54991 |
# test performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.82998 | 0.75951 | 0.81823 |
| Recall | 0.29301 | 0.67742 | 0.48656 |
| Precision | 0.72667 | 0.44840 | 0.57460 |
| F1 | 0.41762 | 0.53961 | 0.52693 |
Logistic Regression at threshold of 0.37 has the highest recall on training set and generalizes well on test set. The accuracy of prediction is also pretty high
Support Vector Machines (SVMs) are a powerful set of supervised learning methods that can be effectively used for both classification and regression tasks. We will try to use it for our classification problem and will see how it performs compared to the other models
df.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON | JOB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | DebtCon | Other |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | HomeImp | Office |
#Let's identify the categorical features
df.columns[df.dtypes == object]
Index(['REASON', 'JOB'], dtype='object')
# apply get_dummies function
df_encoded = pd.get_dummies(df, columns=['REASON','JOB'])
df_encoded.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON_DebtCon | REASON_HomeImp | JOB_Mgr | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
df_encoded.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON_DebtCon | REASON_HomeImp | JOB_Mgr | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
Y = df_encoded.BAD
X = df_encoded.drop("BAD", axis = 1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y )
x_train.shape, x_test.shape
((4768, 19), (1192, 19))
print("Shape of x_train: ", x_train.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of x_test: ", x_test.shape)
print("Shape of y_test: ", y_test.shape)
Shape of x_train: (4768, 19) Shape of y_train: (4768,) Shape of x_test: (1192, 19) Shape of y_test: (1192,)
def print_score(clf, x_train, y_train, x_test, y_test, train=True):
if train:
print("Train Result:\n")
print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(x_train))))
print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(x_train))))
print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(x_train))))
res = cross_val_score(clf, x_train, y_train, cv=10, scoring='accuracy')
print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
elif train==False:
print("Test Result:\n")
print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(x_test))))
print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(x_test))))
print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(x_test))))
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
from sklearn import svm
clf = svm.SVC(kernel='rbf', gamma='auto')
clf.fit(x_train, y_train)
SVC(gamma='auto')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(gamma='auto')
confusion_matrix_sklearn(clf, x_train, y_train)
svm_perf_train = model_performance_classification_sklearn(
clf, x_train, y_train
)
svm_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96099 | 0.92429 | 0.88520 | 0.90432 |
confusion_matrix_sklearn(clf, x_test, y_test)
svm_perf_test = model_performance_classification_sklearn(
clf, x_test, y_test
)
svm_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83305 | 0.40756 | 0.62581 | 0.49364 |
df.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON | JOB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | DebtCon | Other |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | HomeImp | Office |
#Let's identify the categorical features
df.columns[df.dtypes == object]
Index(['REASON', 'JOB'], dtype='object')
# apply get_dummies function
df_encoded = pd.get_dummies(df, columns=['REASON','JOB'])
df_encoded.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON_DebtCon | REASON_HomeImp | JOB_Mgr | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
df_encoded.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON_DebtCon | REASON_HomeImp | JOB_Mgr | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
y = df_encoded.BAD
X = df_encoded.drop("BAD", axis = 1)
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=1, shuffle = True )
X_train.shape, X_test.shape
((4172, 19), (1788, 19))
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (4172, 19) Shape of test set : (1788, 19) Percentage of classes in training set: 0 0.80417 1 0.19583 Name: BAD, dtype: float64 Percentage of classes in test set: 0 0.79195 1 0.20805 Name: BAD, dtype: float64
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86577 | 0.61290 | 0.70370 | 0.65517 |
Before pruning the tree let's check the important features.
Plotting the feature importance of each variable
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="blue", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50,
min_samples_split=10, random_state=1)decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86290 | 0.79437 | 0.61633 | 0.69412 |
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86290 | 0.79437 | 0.61633 | 0.69412 |
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.85235 | 0.74194 | 0.62162 | 0.67647 |
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- DEBTINC <= 34.82 | |--- DELINQ <= 1.50 | | |--- NINQ <= 4.50 | | | |--- CLNO <= 2.80 | | | | |--- weights: [22.38, 28.09] class: 1 | | | |--- CLNO > 2.80 | | | | |--- weights: [889.74, 107.24] class: 0 | | |--- NINQ > 4.50 | | | |--- CLNO <= 3.01 | | | | |--- weights: [9.33, 0.00] class: 0 | | | |--- CLNO > 3.01 | | | | |--- weights: [7.46, 25.53] class: 1 | |--- DELINQ > 1.50 | | |--- CLNO <= 3.60 | | | |--- DELINQ <= 2.50 | | | | |--- weights: [18.65, 15.32] class: 0 | | | |--- DELINQ > 2.50 | | | | |--- weights: [1.24, 33.19] class: 1 | | |--- CLNO > 3.60 | | | |--- DEROG <= 3.00 | | | | |--- weights: [23.01, 0.00] class: 0 | | | |--- DEROG > 3.00 | | | | |--- weights: [0.00, 7.66] class: 1 |--- DEBTINC > 34.82 | |--- DEBTINC <= 34.82 | | |--- DELINQ <= 0.50 | | | |--- CLAGE <= 178.10 | | | | |--- weights: [93.26, 561.71] class: 1 | | | |--- CLAGE > 178.10 | | | | |--- weights: [85.18, 155.75] class: 1 | | |--- DELINQ > 0.50 | | | |--- CLAGE <= 390.62 | | | | |--- weights: [36.68, 668.95] class: 1 | | | |--- CLAGE > 390.62 | | | | |--- weights: [1.87, 0.00] class: 0 | |--- DEBTINC > 34.82 | | |--- DEBTINC <= 43.75 | | | |--- CLAGE <= 202.30 | | | | |--- weights: [525.39, 268.09] class: 0 | | | |--- CLAGE > 202.30 | | | | |--- weights: [360.00, 35.75] class: 0 | | |--- DEBTINC > 43.75 | | | |--- CLAGE <= 285.54 | | | | |--- weights: [4.97, 176.17] class: 1 | | | |--- CLAGE > 285.54 | | | | |--- weights: [6.84, 2.55] class: 0
Plotting the feature importance of each variable
# importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="blue", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations from decision tree
The rules obtained from the decision tree can be interpreted as:
If we want more complex then we can go in more depth of the tree
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | -0.00000 |
| 1 | 0.00000 | -0.00000 |
| 2 | 0.00000 | -0.00000 |
| 3 | 0.00000 | -0.00000 |
| 4 | 0.00000 | -0.00000 |
| ... | ... | ... |
| 253 | 0.00548 | 0.27333 |
| 254 | 0.00766 | 0.28099 |
| 255 | 0.00776 | 0.28875 |
| 256 | 0.03667 | 0.32542 |
| 257 | 0.08729 | 0.50000 |
258 rows × 2 columns
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08729057011909772
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00038111933333653467,
class_weight='balanced', random_state=1)
confusion_matrix_sklearn(best_model, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95901 | 1.00000 | 0.82692 | 0.90526 |
confusion_matrix_sklearn(best_model, X_test, y_test)
decision_tree_post_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87528 | 0.75538 | 0.68039 | 0.71592 |
Observations
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- DEBTINC <= 34.82 | |--- DELINQ <= 1.50 | | |--- NINQ <= 4.50 | | | |--- CLNO <= 2.80 | | | | |--- DEBTINC <= 22.28 | | | | | |--- weights: [13.06, 0.00] class: 0 | | | | |--- DEBTINC > 22.28 | | | | | |--- PROBINC <= 915.67 | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | |--- PROBINC > 915.67 | | | | | | |--- YOJ <= 3.43 | | | | | | | |--- DEROG <= 2.50 | | | | | | | | |--- weights: [0.62, 28.09] class: 1 | | | | | | | |--- DEROG > 2.50 | | | | | | | | |--- weights: [2.49, 0.00] class: 0 | | | | | | |--- YOJ > 3.43 | | | | | | | |--- weights: [2.49, 0.00] class: 0 | | | |--- CLNO > 2.80 | | | | |--- LOAN <= 17.25 | | | | | |--- CLAGE <= 172.84 | | | | | | |--- CLAGE <= 126.95 | | | | | | | |--- VALUE <= 10.86 | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- VALUE > 10.86 | | | | | | | | |--- weights: [6.22, 0.00] class: 0 | | | | | | |--- CLAGE > 126.95 | | | | | | | |--- weights: [1.24, 15.32] class: 1 | | | | | |--- CLAGE > 172.84 | | | | | | |--- LOAN <= 17.13 | | | | | | | |--- weights: [19.27, 0.00] class: 0 | | | | | | |--- LOAN > 17.13 | | | | | | | |--- VALUE <= 10.79 | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- VALUE > 10.79 | | | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | | |--- LOAN > 17.25 | | | | | |--- DEBTINC <= 2.94 | | | | | | |--- weights: [0.00, 5.11] class: 1 | | | | | |--- DEBTINC > 2.94 | | | | | | |--- DEROG <= 8.00 | | | | | | | |--- YOJ <= 2.74 | | | | | | | | |--- MORTDUE <= 3.72 | | | | | | | | | |--- LOAN <= 19.91 | | | | | | | | | | |--- PROBINC <= 731.84 | | | | | | | | | | | |--- weights: [3.11, 0.00] class: 0 | | | | | | | | | | |--- PROBINC > 731.84 | | | | | | | | | | | |--- weights: [0.00, 12.77] class: 1 | | | | | | | | | |--- LOAN > 19.91 | | | | | | | | | | |--- DEBTINC <= 31.79 | | | | | | | | | | | |--- weights: [16.79, 0.00] class: 0 | | | | | | | | | | |--- DEBTINC > 31.79 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- MORTDUE > 3.72 | | | | | | | | | |--- VALUE <= 11.81 | | | | | | | | | | |--- CLAGE <= 338.47 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- CLAGE > 338.47 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- VALUE > 11.81 | | | | | | | | | | |--- JOB_Other <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- JOB_Other > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- YOJ > 2.74 | | | | | | | | |--- DEBTINC <= 33.61 | | | | | | | | | |--- LOAN <= 20.66 | | | | | | | | | | |--- LOAN <= 20.59 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- LOAN > 20.59 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- LOAN > 20.66 | | | | | | | | | | |--- JOB_Mgr <= 0.50 | | | | | | | | | | | |--- weights: [266.11, 0.00] class: 0 | | | | | | | | | | |--- JOB_Mgr > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- DEBTINC > 33.61 | | | | | | | | | |--- DEBTINC <= 33.70 | | | | | | | | | | |--- YOJ <= 2.89 | | | | | | | | | | | |--- weights: [0.62, 7.66] class: 1 | | | | | | | | | | |--- YOJ > 2.89 | | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | | |--- DEBTINC > 33.70 | | | | | | | | | | |--- DEBTINC <= 34.54 | | | | | | | | | | | |--- weights: [47.88, 0.00] class: 0 | | | | | | | | | | |--- DEBTINC > 34.54 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | |--- DEROG > 8.00 | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | |--- NINQ > 4.50 | | | |--- CLNO <= 3.01 | | | | |--- weights: [9.33, 0.00] class: 0 | | | |--- CLNO > 3.01 | | | | |--- YOJ <= 3.16 | | | | | |--- CLAGE <= 91.58 | | | | | | |--- VALUE <= 11.41 | | | | | | | |--- weights: [3.11, 0.00] class: 0 | | | | | | |--- VALUE > 11.41 | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | |--- CLAGE > 91.58 | | | | | | |--- weights: [0.62, 22.98] class: 1 | | | | |--- YOJ > 3.16 | | | | | |--- weights: [3.73, 0.00] class: 0 | |--- DELINQ > 1.50 | | |--- CLNO <= 3.60 | | | |--- DELINQ <= 2.50 | | | | |--- NINQ <= 0.50 | | | | | |--- weights: [12.44, 0.00] class: 0 | | | | |--- NINQ > 0.50 | | | | | |--- REASON_HomeImp <= 0.50 | | | | | | |--- CLAGE <= 271.60 | | | | | | | |--- weights: [4.97, 0.00] class: 0 | | | | | | |--- CLAGE > 271.60 | | | | | | | |--- weights: [0.62, 2.55] class: 1 | | | | | |--- REASON_HomeImp > 0.50 | | | | | | |--- weights: [0.62, 12.77] class: 1 | | | |--- DELINQ > 2.50 | | | | |--- DEBTINC <= 34.57 | | | | | |--- weights: [0.00, 33.19] class: 1 | | | | |--- DEBTINC > 34.57 | | | | | |--- weights: [1.24, 0.00] class: 0 | | |--- CLNO > 3.60 | | | |--- DEROG <= 3.00 | | | | |--- weights: [23.01, 0.00] class: 0 | | | |--- DEROG > 3.00 | | | | |--- weights: [0.00, 7.66] class: 1 |--- DEBTINC > 34.82 | |--- DEBTINC <= 34.82 | | |--- DELINQ <= 0.50 | | | |--- CLAGE <= 178.10 | | | | |--- YOJ <= 3.54 | | | | | |--- YOJ <= 2.72 | | | | | | |--- YOJ <= 2.31 | | | | | | | |--- MORTDUE <= 3.62 | | | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | | | | | |--- MORTDUE > 3.62 | | | | | | | | |--- weights: [1.87, 7.66] class: 1 | | | | | | |--- YOJ > 2.31 | | | | | | | |--- MORTDUE <= 3.78 | | | | | | | | |--- weights: [3.73, 97.02] class: 1 | | | | | | | |--- MORTDUE > 3.78 | | | | | | | | |--- MORTDUE <= 3.81 | | | | | | | | | |--- weights: [2.49, 0.00] class: 0 | | | | | | | | |--- MORTDUE > 3.81 | | | | | | | | | |--- DEROG <= 0.50 | | | | | | | | | | |--- CLAGE <= 175.40 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- CLAGE > 175.40 | | | | | | | | | | | |--- weights: [1.24, 0.00] class: 0 | | | | | | | | | |--- DEROG > 0.50 | | | | | | | | | | |--- weights: [1.24, 40.85] class: 1 | | | | | |--- YOJ > 2.72 | | | | | | |--- CLNO <= 2.80 | | | | | | | |--- weights: [1.87, 51.06] class: 1 | | | | | | |--- CLNO > 2.80 | | | | | | | |--- MORTDUE <= 4.38 | | | | | | | | |--- LOAN <= 17.16 | | | | | | | | | |--- weights: [3.11, 48.51] class: 1 | | | | | | | | |--- LOAN > 17.16 | | | | | | | | | |--- MORTDUE <= 3.25 | | | | | | | | | | |--- weights: [2.49, 0.00] class: 0 | | | | | | | | | |--- MORTDUE > 3.25 | | | | | | | | | | |--- LOAN <= 17.57 | | | | | | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | | | | | | | | |--- LOAN > 17.57 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | |--- MORTDUE > 4.38 | | | | | | | | |--- weights: [2.49, 0.00] class: 0 | | | | |--- YOJ > 3.54 | | | | | |--- weights: [1.87, 0.00] class: 0 | | | |--- CLAGE > 178.10 | | | | |--- YOJ <= 2.73 | | | | | |--- JOB_Office <= 0.50 | | | | | | |--- CLAGE <= 316.03 | | | | | | | |--- CLAGE <= 275.98 | | | | | | | | |--- YOJ <= 2.52 | | | | | | | | | |--- weights: [5.60, 45.96] class: 1 | | | | | | | | |--- YOJ > 2.52 | | | | | | | | | |--- CLNO <= 3.35 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | | |--- CLNO > 3.35 | | | | | | | | | | |--- VALUE <= 10.84 | | | | | | | | | | | |--- weights: [1.24, 0.00] class: 0 | | | | | | | | | | |--- VALUE > 10.84 | | | | | | | | | | | |--- weights: [2.49, 20.43] class: 1 | | | | | | | |--- CLAGE > 275.98 | | | | | | | | |--- YOJ <= 2.48 | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- YOJ > 2.48 | | | | | | | | | |--- LOAN <= 21.72 | | | | | | | | | | |--- NINQ <= 2.00 | | | | | | | | | | | |--- weights: [0.00, 5.11] class: 1 | | | | | | | | | | |--- NINQ > 2.00 | | | | | | | | | | | |--- weights: [1.24, 0.00] class: 0 | | | | | | | | | |--- LOAN > 21.72 | | | | | | | | | | |--- weights: [1.24, 0.00] class: 0 | | | | | | |--- CLAGE > 316.03 | | | | | | | |--- weights: [0.62, 22.98] class: 1 | | | | | |--- JOB_Office > 0.50 | | | | | | |--- weights: [3.11, 0.00] class: 0 | | | | |--- YOJ > 2.73 | | | | | |--- YOJ <= 3.07 | | | | | | |--- VALUE <= 10.44 | | | | | | | |--- weights: [0.00, 7.66] class: 1 | | | | | | |--- VALUE > 10.44 | | | | | | | |--- VALUE <= 12.24 | | | | | | | | |--- JOB_Mgr <= 0.50 | | | | | | | | | |--- weights: [28.60, 0.00] class: 0 | | | | | | | | |--- JOB_Mgr > 0.50 | | | | | | | | | |--- CLAGE <= 259.30 | | | | | | | | | | |--- weights: [3.11, 0.00] class: 0 | | | | | | | | | |--- CLAGE > 259.30 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- VALUE > 12.24 | | | | | | | | |--- VALUE <= 12.29 | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | |--- VALUE > 12.29 | | | | | | | | | |--- weights: [2.49, 0.00] class: 0 | | | | | |--- YOJ > 3.07 | | | | | | |--- CLNO <= 3.51 | | | | | | | |--- CLAGE <= 219.65 | | | | | | | | |--- weights: [9.33, 0.00] class: 0 | | | | | | | |--- CLAGE > 219.65 | | | | | | | | |--- CLAGE <= 272.90 | | | | | | | | | |--- YOJ <= 3.54 | | | | | | | | | | |--- weights: [1.24, 12.77] class: 1 | | | | | | | | | |--- YOJ > 3.54 | | | | | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | | | | | | |--- CLAGE > 272.90 | | | | | | | | | |--- weights: [4.97, 0.00] class: 0 | | | | | | |--- CLNO > 3.51 | | | | | | | |--- JOB_Other <= 0.50 | | | | | | | | |--- NINQ <= 4.50 | | | | | | | | | |--- CLAGE <= 181.60 | | | | | | | | | | |--- weights: [0.62, 0.00] class: 0 | | | | | | | | | |--- CLAGE > 181.60 | | | | | | | | | | |--- YOJ <= 3.51 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- YOJ > 3.51 | | | | | | | | | | | |--- weights: [0.00, 15.32] class: 1 | | | | | | | | |--- NINQ > 4.50 | | | | | | | | | |--- weights: [1.24, 0.00] class: 0 | | | | | | | |--- JOB_Other > 0.50 | | | | | | | | |--- VALUE <= 11.97 | | | | | | | | | |--- weights: [5.60, 0.00] class: 0 | | | | | | | | |--- VALUE > 11.97 | | | | | | | | | |--- weights: [0.00, 5.11] class: 1 | | |--- DELINQ > 0.50 | | | |--- CLAGE <= 390.62 | | | | |--- MORTDUE <= 4.64 | | | | | |--- weights: [34.82, 666.40] class: 1 | | | | |--- MORTDUE > 4.64 | | | | | |--- CLAGE <= 183.18 | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | |--- CLAGE > 183.18 | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | |--- CLAGE > 390.62 | | | | |--- weights: [1.87, 0.00] class: 0 | |--- DEBTINC > 34.82 | | |--- DEBTINC <= 43.75 | | | |--- CLAGE <= 202.30 | | | | |--- VALUE <= 11.44 | | | | | |--- DEROG <= 1.50 | | | | | | |--- LOAN <= 21.45 | | | | | | | |--- DEBTINC <= 40.19 | | | | | | | | |--- YOJ <= 2.35 | | | | | | | | | |--- weights: [16.79, 0.00] class: 0 | | | | | | | | |--- YOJ > 2.35 | | | | | | | | | |--- PROBINC <= 912.38 | | | | | | | | | | |--- CLNO <= 3.28 | | | | | | | | | | | |--- weights: [0.62, 12.77] class: 1 | | | | | | | | | | |--- CLNO > 3.28 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- PROBINC > 912.38 | | | | | | | | | | |--- VALUE <= 11.23 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- VALUE > 11.23 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | |--- DEBTINC > 40.19 | | | | | | | | |--- YOJ <= 2.92 | | | | | | | | | |--- CLAGE <= 170.44 | | | | | | | | | | |--- NINQ <= 3.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- NINQ > 3.50 | | | | | | | | | | | |--- weights: [0.00, 7.66] class: 1 | | | | | | | | | |--- CLAGE > 170.44 | | | | | | | | | | |--- weights: [0.62, 17.87] class: 1 | | | | | | | | |--- YOJ > 2.92 | | | | | | | | | |--- JOB_ProfExe <= 0.50 | | | | | | | | | | |--- YOJ <= 3.42 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- YOJ > 3.42 | | | | | | | | | | | |--- weights: [2.49, 0.00] class: 0 | | | | | | | | | |--- JOB_ProfExe > 0.50 | | | | | | | | | | |--- weights: [3.11, 0.00] class: 0 | | | | | | |--- LOAN > 21.45 | | | | | | | |--- LOAN <= 24.31 | | | | | | | | |--- CLAGE <= 53.21 | | | | | | | | | |--- CLNO <= 2.86 | | | | | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | | | | | | | |--- CLNO > 2.86 | | | | | | | | | | |--- weights: [0.00, 5.11] class: 1 | | | | | | | | |--- CLAGE > 53.21 | | | | | | | | | |--- VALUE <= 11.43 | | | | | | | | | | |--- CLAGE <= 200.84 | | | | | | | | | | | |--- weights: [70.26, 0.00] class: 0 | | | | | | | | | | |--- CLAGE > 200.84 | | | | | | | | | | | |--- weights: [0.62, 2.55] class: 1 | | | | | | | | | |--- VALUE > 11.43 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- LOAN > 24.31 | | | | | | | | |--- CLNO <= 3.53 | | | | | | | | | |--- weights: [0.00, 10.21] class: 1 | | | | | | | | |--- CLNO > 3.53 | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | |--- DEROG > 1.50 | | | | | | |--- MORTDUE <= 4.04 | | | | | | | |--- weights: [2.49, 25.53] class: 1 | | | | | | |--- MORTDUE > 4.04 | | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | | |--- VALUE > 11.44 | | | | | |--- LOAN <= 25.00 | | | | | | |--- DELINQ <= 5.00 | | | | | | | |--- CLAGE <= 173.52 | | | | | | | | |--- DEROG <= 0.50 | | | | | | | | | |--- YOJ <= 2.86 | | | | | | | | | | |--- CLNO <= 3.31 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- CLNO > 3.31 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- YOJ > 2.86 | | | | | | | | | | |--- JOB_Mgr <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- JOB_Mgr > 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | |--- DEROG > 0.50 | | | | | | | | | |--- JOB_ProfExe <= 0.50 | | | | | | | | | | |--- LOAN <= 22.76 | | | | | | | | | | | |--- weights: [0.62, 10.21] class: 1 | | | | | | | | | | |--- LOAN > 22.76 | | | | | | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | | | | | | | |--- JOB_ProfExe > 0.50 | | | | | | | | | | |--- weights: [4.35, 0.00] class: 0 | | | | | | | |--- CLAGE > 173.52 | | | | | | | | |--- weights: [72.12, 0.00] class: 0 | | | | | | |--- DELINQ > 5.00 | | | | | | | |--- weights: [0.00, 7.66] class: 1 | | | | | |--- LOAN > 25.00 | | | | | | |--- JOB_Self <= 0.50 | | | | | | | |--- weights: [0.00, 12.77] class: 1 | | | | | | |--- JOB_Self > 0.50 | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | |--- CLAGE > 202.30 | | | | |--- DELINQ <= 1.50 | | | | | |--- CLAGE <= 795.72 | | | | | | |--- VALUE <= 12.55 | | | | | | | |--- JOB_Sales <= 0.50 | | | | | | | | |--- YOJ <= 2.35 | | | | | | | | | |--- LOAN <= 22.14 | | | | | | | | | | |--- weights: [17.41, 0.00] class: 0 | | | | | | | | | |--- LOAN > 22.14 | | | | | | | | | | |--- LOAN <= 22.75 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- LOAN > 22.75 | | | | | | | | | | | |--- weights: [5.60, 0.00] class: 0 | | | | | | | | |--- YOJ > 2.35 | | | | | | | | | |--- MORTDUE <= 4.54 | | | | | | | | | | |--- weights: [300.93, 0.00] class: 0 | | | | | | | | | |--- MORTDUE > 4.54 | | | | | | | | | | |--- YOJ <= 2.94 | | | | | | | | | | | |--- weights: [16.79, 0.00] class: 0 | | | | | | | | | | |--- YOJ > 2.94 | | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- JOB_Sales > 0.50 | | | | | | | | |--- CLNO <= 3.85 | | | | | | | | | |--- weights: [5.60, 0.00] class: 0 | | | | | | | | |--- CLNO > 3.85 | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | |--- VALUE > 12.55 | | | | | | | |--- LOAN <= 24.90 | | | | | | | | |--- weights: [3.11, 0.00] class: 0 | | | | | | | |--- LOAN > 24.90 | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | |--- CLAGE > 795.72 | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | |--- DELINQ > 1.50 | | | | | |--- YOJ <= 2.52 | | | | | | |--- DELINQ <= 3.50 | | | | | | | |--- weights: [5.60, 0.00] class: 0 | | | | | | |--- DELINQ > 3.50 | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | |--- YOJ > 2.52 | | | | | | |--- VALUE <= 11.82 | | | | | | | |--- weights: [2.49, 17.87] class: 1 | | | | | | |--- VALUE > 11.82 | | | | | | | |--- weights: [1.24, 0.00] class: 0 | | |--- DEBTINC > 43.75 | | | |--- CLAGE <= 285.54 | | | | |--- DEBTINC <= 44.57 | | | | | |--- CLAGE <= 238.43 | | | | | | |--- weights: [2.49, 20.43] class: 1 | | | | | |--- CLAGE > 238.43 | | | | | | |--- weights: [1.87, 0.00] class: 0 | | | | |--- DEBTINC > 44.57 | | | | | |--- weights: [0.62, 155.75] class: 1 | | | |--- CLAGE > 285.54 | | | | |--- DEBTINC <= 46.62 | | | | | |--- weights: [6.84, 0.00] class: 0 | | | | |--- DEBTINC > 46.62 | | | | | |--- weights: [0.00, 2.55] class: 1
Plotting the feature importance of each variable
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="Blue", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations from tree
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.00000 | 0.86290 | 0.95901 |
| Recall | 1.00000 | 0.79437 | 1.00000 |
| Precision | 1.00000 | 0.61633 | 0.82692 |
| F1 | 1.00000 | 0.69412 | 0.90526 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.86577 | 0.85235 | 0.87528 |
| Recall | 0.61290 | 0.74194 | 0.75538 |
| Precision | 0.70370 | 0.62162 | 0.68039 |
| F1 | 0.65517 | 0.67647 | 0.71592 |
Observations
df.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON | JOB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | HomeImp | Other |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | DebtCon | Other |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | HomeImp | Office |
#Let's identify the categorical features
df.columns[df.dtypes == object]
Index(['REASON', 'JOB'], dtype='object')
# apply get_dummies function
df_encoded = pd.get_dummies(df, columns=['REASON','JOB'])
df_encoded.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON_DebtCon | REASON_HomeImp | JOB_Mgr | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
df_encoded.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | PROBINC | BAD | REASON_DebtCon | REASON_HomeImp | JOB_Mgr | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.10492 | 3.56105 | 10.57221 | 3.02042 | 0.00000 | 0.00000 | 94.36667 | 1.00000 | 2.94444 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 17.10492 | 4.03347 | 11.13327 | 2.83321 | 0.00000 | 2.00000 | 121.83333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 17.10492 | 3.28316 | 9.72376 | 2.63906 | 0.00000 | 0.00000 | 149.46667 | 1.00000 | 2.99573 | 34.81826 | 1975.70831 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 17.10492 | 3.99604 | 11.39915 | 2.83321 | 0.00000 | 0.00000 | 173.46667 | 1.00000 | 3.40120 | 34.81826 | 1975.70831 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 17.10492 | 4.20526 | 11.62634 | 2.56495 | 0.00000 | 0.00000 | 93.33333 | 0.00000 | 3.17805 | 34.81826 | 1975.70831 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
Y = df_encoded.BAD
X = df_encoded.drop("BAD", axis = 1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y )
x_train.shape, x_test.shape
((4768, 19), (1192, 19))
Over Sampling Using SMOTE
sm = SMOTE(random_state=12)
x_train_r, y_train_r = sm.fit_resample(x_train, y_train)
clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)
clf_rf.fit(x_train_r, y_train_r)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=42)
confusion_matrix_sklearn(clf_rf, x_train, y_train)
rf_perf_train = model_performance_classification_sklearn(
clf_rf, x_train, y_train
)
rf_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
confusion_matrix_sklearn(clf_rf, x_test, y_test)
rf_perf_test = model_performance_classification_sklearn(
clf_rf, x_test, y_test
)
rf_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.91107 | 0.76471 | 0.78448 | 0.77447 |
y = df_encoded.BAD
X = df_encoded.drop("BAD", axis = 1)
# Splitting the data into train and test sets in 70:30 ratio
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state=1, shuffle = True )
x_train.shape, x_test.shape
((4172, 19), (1788, 19))
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model. The model_performance_classification_statsmodels function will be used to check the model performance of models. The confusion_matrix_statsmodels function will be used to plot the confusion matrix.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
rf_classifier=RandomForestClassifier(random_state=42)
rf_classifier.fit(x_train,y_train)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=42)
rf_classifier_model_train_perf = model_performance_classification_sklearn(rf_classifier, x_train,y_train)
print("Training performance \n",rf_classifier_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.00000 1.00000 1.00000 1.00000
rf_classifier_model_test_perf = model_performance_classification_sklearn(rf_classifier, x_test,y_test)
print("Testing performance \n",rf_classifier_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.91163 0.68817 0.85906 0.76418
print(pd.DataFrame(rf_classifier.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
Imp DEBTINC 0.18387 PROBINC 0.11402 CLAGE 0.09490 DELINQ 0.08407 VALUE 0.08254 MORTDUE 0.07885 LOAN 0.07853 CLNO 0.07212 YOJ 0.05547 DEROG 0.05260 NINQ 0.04228 REASON_HomeImp 0.00923 JOB_Office 0.00907 JOB_Other 0.00887 REASON_DebtCon 0.00868 JOB_ProfExe 0.00867 JOB_Mgr 0.00710 JOB_Sales 0.00529 JOB_Self 0.00383
feature_names = x_train.columns
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
from sklearn.ensemble import AdaBoostClassifier
ab_classifier=AdaBoostClassifier(n_estimators=100, learning_rate=0.2, random_state=4)
ab_classifier.fit(x_train,y_train)
AdaBoostClassifier(learning_rate=0.2, n_estimators=100, random_state=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(learning_rate=0.2, n_estimators=100, random_state=4)
ab_classifier_model_train_perf = model_performance_classification_sklearn(ab_classifier, x_train,y_train)
print("Training performance \n",ab_classifier_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.89334 0.56671 0.83574 0.67542
ab_classifier_model_test_perf = model_performance_classification_sklearn(ab_classifier, x_test,y_test)
print("Testing performance \n",ab_classifier_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.88647 0.54032 0.86266 0.66446
print(pd.DataFrame(ab_classifier.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
Imp DEBTINC 0.36000 DELINQ 0.14000 CLAGE 0.10000 CLNO 0.07000 PROBINC 0.07000 DEROG 0.06000 VALUE 0.04000 NINQ 0.04000 LOAN 0.03000 YOJ 0.03000 MORTDUE 0.02000 JOB_Office 0.02000 JOB_Sales 0.02000 REASON_DebtCon 0.00000 REASON_HomeImp 0.00000 JOB_Mgr 0.00000 JOB_Other 0.00000 JOB_ProfExe 0.00000 JOB_Self 0.00000
feature_names = x_train.columns
importances = ab_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
from sklearn.ensemble import GradientBoostingClassifier
gb_classifier=GradientBoostingClassifier(n_estimators=100, learning_rate=0.2, random_state=4)
gb_classifier.fit(x_train,y_train)
GradientBoostingClassifier(learning_rate=0.2, random_state=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(learning_rate=0.2, random_state=4)
gb_classifier_model_train_perf = model_performance_classification_sklearn(gb_classifier, x_train,y_train)
print("Training performance \n",gb_classifier_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.94535 0.78703 0.92253 0.84941
gb_classifier_model_test_perf = model_performance_classification_sklearn(gb_classifier, x_test,y_test)
print("Testing performance \n",gb_classifier_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.90940 0.66667 0.86713 0.75380
print(pd.DataFrame(gb_classifier.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
Imp DEBTINC 0.31110 PROBINC 0.18286 DELINQ 0.13044 DEROG 0.07112 CLAGE 0.07053 CLNO 0.04807 VALUE 0.04442 MORTDUE 0.04226 LOAN 0.03072 YOJ 0.02787 NINQ 0.02290 JOB_Office 0.00406 JOB_Sales 0.00405 JOB_Other 0.00308 JOB_Mgr 0.00263 JOB_ProfExe 0.00180 REASON_HomeImp 0.00152 REASON_DebtCon 0.00057 JOB_Self 0.00000
feature_names = x_train.columns
importances = gb_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show();
from xgboost import XGBClassifier
xgb_classifier=XGBClassifier(random_state=1, verbosity = 0)
xgb_classifier.fit(x_train,y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)xgb_classifier_model_train_perf = model_performance_classification_sklearn(xgb_classifier, x_train, y_train)
print("Training performance \n",xgb_classifier_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.99832 0.99388 0.99754 0.99571
xgb_classifier_model_test_perf = model_performance_classification_sklearn(xgb_classifier, x_test, y_test)
print("Testing performance \n",xgb_classifier_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.91387 0.69624 0.86333 0.77083
print(pd.DataFrame(xgb_classifier.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
Imp DEBTINC 0.21867 PROBINC 0.14010 DELINQ 0.12172 DEROG 0.07227 JOB_Sales 0.05035 JOB_Self 0.04820 CLAGE 0.03957 JOB_Mgr 0.03616 NINQ 0.03152 JOB_Office 0.03132 CLNO 0.03057 YOJ 0.02938 VALUE 0.02908 MORTDUE 0.02804 LOAN 0.02752 REASON_DebtCon 0.02740 JOB_ProfExe 0.02386 JOB_Other 0.01427 REASON_HomeImp 0.00000
feature_names = x_train.columns
importances = xgb_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show();
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
svm_perf_train.T,
rf_perf_train.T,
rf_classifier_model_train_perf.T,
ab_classifier_model_train_perf.T,
gb_classifier_model_train_perf.T,
xgb_classifier_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
"Support Vector Merchine",
"Random Forest (resampled)",
"Bagging",
"Ada Boost","Gradient Boost", "XG Boost",
]
# test set performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
svm_perf_test.T,
rf_perf_test.T,
rf_classifier_model_test_perf.T,
ab_classifier_model_test_perf.T,
gb_classifier_model_test_perf.T,
xgb_classifier_model_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
"Support Vector Merchine",
"Random Forest (resampled)",
"Bagging",
"Ada Boost","Gradient Boost", "XG Boost",
]
models_train_comp_df.T
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Logistic Regression-default Threshold | 0.85163 | 0.35373 | 0.76053 | 0.48287 |
| Logistic Regression-0.37 Threshold | 0.76103 | 0.68543 | 0.43077 | 0.52905 |
| Logistic Regression-0.42 Threshold | 0.82814 | 0.53611 | 0.56443 | 0.54991 |
| Decision Tree sklearn | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| Decision Tree (Pre-Pruning) | 0.86290 | 0.79437 | 0.61633 | 0.69412 |
| Decision Tree (Post-Pruning) | 0.95901 | 1.00000 | 0.82692 | 0.90526 |
| Support Vector Merchine | 0.96099 | 0.92429 | 0.88520 | 0.90432 |
| Random Forest (resampled) | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| Bagging | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| Ada Boost | 0.89334 | 0.56671 | 0.83574 | 0.67542 |
| Gradient Boost | 0.94535 | 0.78703 | 0.92253 | 0.84941 |
| XG Boost | 0.99832 | 0.99388 | 0.99754 | 0.99571 |
models_test_comp_df.T
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Logistic Regression-default Threshold | 0.82998 | 0.29301 | 0.72667 | 0.41762 |
| Logistic Regression-0.37 Threshold | 0.75951 | 0.67742 | 0.44840 | 0.53961 |
| Logistic Regression-0.42 Threshold | 0.81823 | 0.48656 | 0.57460 | 0.52693 |
| Decision Tree sklearn | 0.86577 | 0.61290 | 0.70370 | 0.65517 |
| Decision Tree (Pre-Pruning) | 0.85235 | 0.74194 | 0.62162 | 0.67647 |
| Decision Tree (Post-Pruning) | 0.87528 | 0.75538 | 0.68039 | 0.71592 |
| Support Vector Merchine | 0.83305 | 0.40756 | 0.62581 | 0.49364 |
| Random Forest (resampled) | 0.91107 | 0.76471 | 0.78448 | 0.77447 |
| Bagging | 0.91163 | 0.68817 | 0.85906 | 0.76418 |
| Ada Boost | 0.88647 | 0.54032 | 0.86266 | 0.66446 |
| Gradient Boost | 0.90940 | 0.66667 | 0.86713 | 0.75380 |
| XG Boost | 0.91387 | 0.69624 | 0.86333 | 0.77083 |
The nature of predictions made by the models will translate as follows:
True positives (TP) are defaults correctly predicted by the model.
False negatives (FN) are defaulted clients in reality, which the model has predicted not-defaulted
False positives (FP)are not-defaulted clients in reality, which the model has predicted, defaulted
Model can make wrong predictions as:
Predicting a default but in reality the client has not defaulted .
Predicting a Not-defaulted but in reality the client has defaulted
Which case is more important?
If we predict a client will default and in actuallity they do not, for banks this doesnt particularly hurt so much - no real financial loss.
If, on the other hand, we predict the client will not default and an it does default in reality, this will be quite hurtful to the bank leading financial loss i.e loan write off, which directly impact bottom line
To reduce this loss Recall should be maximized (Need to reduce False Negative), the greater the recall the higher the chances of identifying both the classes correctly.
Random Forest (resampled) with the highest recall of 0.76 and model accuracy of 91% and Decision tree with equally high recall are the models of interest
Amongst all the models we tried in this case Random forest(resampled) has the best recall on test score of 76% and overal model accuracy of 91%
feature_names = x_train.columns
importances = clf_rf.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
The three most important feature that determine wether a client will pay or not are DEBTIN,DELINQ and PROBINC
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due)
DEBTINC: Debt-to-income ratio (all monthly debt payments divided by gross monthly
PROBINC is (current debt on mortgage)/(debt to income ratio).